Introduction


Geospatial analysis is the ingestion, manipulation, display, and modeling of geographic/spatial data. Spatial data can be represented explicitly as coordinates (latitude & longitude) or implicitly as addresses, Census tracts, or other identifiers.

This guide outlines the many tools in R for geospatial analysis and mapping. This is a long guide, so if you need something specific, we encourage you to scroll to the appropriate section using the Table of Contents on the left.

We hope this guide is a useful resource that helps you accomplish your existing research ideas or helps you come up with entirely new ideas. Please don’t hesitate to contact Aaron Williams () or Ajjit Narayanan () if you have any questions about this guide or need any assistance with R. And make sure to join our slack channel at #r-users-group!

Rob Pitingolo () and the Urban Institute Mapping Users Group are also a fantastic resource to learn more about maps.

Why map with R?

R can have a steeper learning curve than point-and-click tools - like QGIS or ArcGIS - for geospatial analysis and mapping. But creating maps in R has many advantages including:

  1. Reproducibility: By creating maps with R code, you can easily share the outputs and the code that generated the output with collaborators, allowing them to replicate your work and catch errors easily.
  2. Iteration: With point and click software like ArcGIS, making 50 maps would be 50 times the work/time. But using R, we can easily make make many iterations of the same map with a few changes to the code.
  3. Easy Updates: Writing code provides a roadmap for others (and future you!) to quickly update parts of the map as needed. Say for example a collabarator wanted to change the legend colors of 50 state maps. With R, this is possible in just a few seconds!
  4. An Expansive ecosystem: There are several R packages that make it very easy to get spatial data, create static and interactive maps, and perform spatial analyses. This feature rich package ecosystem is frankly unmatched by other programming languages and even point and click tools like QGIS and ArcGIS. Some of these R packages include:
  • sf: For managing and analyzing spatial dataframes
  • tigris: For downloading in Census geographies
  • ggplot2: For making publication ready static maps
  • urbnmapr: For automatically adding Urban styling to static maps
  • mapview: For making exploratory interactive maps
  1. Cost: Most point-and-click tools for geospatial analysis are proprietary and expensive. R is free opensource software. The software and most of its packages can be used for free by anyone for almost any use case.

Helpful Resources

Basic Concepts


library(sf)

The sf library is a key tool used for working with spatial data in R. sf stands for simple features (not San Francisco you Bay Area folks) and denotes a way to describe the spatial attributes of real life objects. The R object you will be working with most frequently for mapping in R is an sf dataframe. An sf dataframe is essentially a regular R dataframe, with a couple of extra features for use in mapping. These extra features exclusive to sf dataframes include:

  • sticky geometry columns
  • attached coordinate reference systems
  • some other spatial metadata

The most important of the above list is the sticky geometry column, which is a special column that contains all of the geographic information for each row of data. Say for example you had a sf dataframe of all DC census tracts. Then the geometry column would contain all of the geographic points used to define DC census tract polygons. The stickiness of this column means that no matter what data munging/filtering you do, you will not be able to drop or delete the geometry column. This is because if you do, then all the geographic information is lost and the object can no longer be an sf dataframe. Below is a graphic to help you understand this:

TODO: Insert Allison Horst Image

credits: @allisonhorst

Below is an example of an sf dataframe. Not that there is some spatial metadata which shows up as a header before the contents of the dataframe.

Since sf dataframes operate similarly to regular dataframes, we can use all our familiar tidyverse functions for data wrangling, including select, filter, rename, mutate, group_by and summarize. The sf package also has many functions that provide easy ways to replicate common tasks done in other GIS software like spatial joins, clipping, and buffering. Almost all of the mapping and geospatial analysis methods described in this guide rely on you having an sf dataframe. So now let’s talk about how to get one!

Importing spatial data

The first step to mapping in R is reading in your spatial data and ensuring it is an sf dataframe. R provides many helpful packages and functions to do this.

For states and counties

We highly recommend using the libraray(urbnmapr) package, which was created by folks here at Urban to easily create state and county level maps. The get_urbn_map() allows you to read in spatial ldata on states and counties, with options to include territories. Importantly, it will also display AL and HI as insets on the map in accordance with the Urban Institute Data Visualization Style Guide. For infromation on how to install urbnmapr, see the GitHub repository.

Below is an example of how you would use urbnmapr to get an sf dataframe of all the states in the US.

library(urbnmapr)

states <- get_urbn_map("counties", sf = TRUE)

For tracts and other Census geographies

We recommend using the library(tigris) package, which allows you to easily download TIGER and other cartographic boundaries from the US Census Bureau. In order to automatically load in the boundaries as sf objects, run once per R session.

library(tigris) has all standard census geographies, including census tracts, counties, CBSAs, ZCTAs, congressional districts, tribal areas, and more. It also includes other elements such as water, roads, and military bases.

By default, libraray(tigris) will download large very detailed TIGER line boundary files. This means that the boundaries are not clipped to the shoreline and can be very large in size. For thematic mapping, cartographic boundary files are a better choice, as they are clipped to the shoreline, generalized, and therefore usually smaller in size without losing too much accuracy. To load cartographic boundaries, include the cb = TRUE argument. If you are doing detailed geospatial analysis and need the most detailed shapefiles, then you should use the detailed TIGER line boundary files and set cb = FALSE.

library(tigris) functions also allow you to load boundaries for previous years, though they will default to the most recent year.

Unlike library(urbnmapr), different functions are used to get geographic data for different geographic levels- for instance, the blocks() function will load census block groups, and the tracts() function will load tract data. For the full list of supported geographies and functions, see the package vignette. Below is an example of how you would use library(tigris) to get a sf dataframe of all Census tracts in Maryland for 2015.

library(tigris)

# Only need to set once per script
options(tigris_class = "sf")

md_zctas <- tracts(state = "MD",
                  cb = TRUE,
                  year = 2015)

For folks interested in also pulling in Census demographic information along with Census geographies, we recommend checking out the sister package to libraray(tigris): library(tidycensus). That package allows you to download in Census variables and Census geographic data simultaneously.

For boundaries outside the US

We recommend using the library(rnaturalearth) package, which is similar to library(tigris) but allows you to download and use boundaries beyond the US. Instead of setting class to sf one time per session as we did with library(tigris), you must set the returnclass = "sf" argument each time you use a function from the package. Below is an example of downloading in an sf dataframe of all the countries in the world.

library(rnaturalearth)

world <- ne_countries(returnclass = "sf")

ggplot() +
  geom_sf(data = world, mapping = aes())

Your own files

Shapefiles/GeoJSONS

Shapefiles and GeoJSONs are 2 common spatial file formats you might found out in the wild. libraray(sf) has a function called st_read which allows you to easily read in these files as sf dataframes. The only required argument is dsn or data source name. This is the filepath of the .shp file or the .geojson file on your local computer. It can also be a URL for a geojson file.

Bear in mind that most shapefiles are actually stored as 6 different files inside a folder. You will want to provide the file path to the file ending in .shp.

library(sf)

list.files("mapping/data/shapefiles")
## [1] "Fire_Stations.cpg" "Fire_Stations.dbf" "Fire_Stations.prj"
## [4] "Fire_Stations.shp" "Fire_Stations.shx" "Fire_Stations.xml"
dc_firestations <- st_read(dsn = here("mapping", "data", "shapefiles", "Fire_Stations.shp"))
## Reading layer `Fire_Stations' from data source `/Users/anarayanan/Documents/GitHub/r-at-urban/mapping/data/shapefiles/Fire_Stations.shp' using driver `ESRI Shapefile'
## Simple feature collection with 44 features and 22 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -77.09353 ymin: 38.82054 xmax: -76.9334 ymax: 38.97343
## Geodetic CRS:  WGS 84
ggplot() +
  geom_sf(dc_firestations, mapping = aes())

st_read also supports a wide variety of other spatial file formats, including geodatabases, .KML files, and many others. For a full list, please see the function help page.

CSVs or dataframes with lat/lons

When you have latitudes/longitudes

library(sf) contains the st_as_sf() function for converting existing dataframes that have columns for longitude and latitude into an sf dataframe. The two arguments you must specify for this function are:

  • coords: the names of the columns corresponding to longitude and latitude (in that order!)
  • crs: The CRS (coordinate references system) for your longitude/latitude coordinates. It is very important that you set the right CRS when converting your data, otherwise your data will now show up where you expect it to. This means you might have to scour the data documentation to figure out which CRS is used for your longitude/latitude columns. In case you can’t find any documentation on the appropriate CRS, a good starting point is CRS 4326, or the WGS 84 projection. This is the most popular CRS used for longitude/latitude. To learn more about CRS, see this section. (TODO: INSERT CRS LINK)

One other common mistake is that before converting to an sf dataframe, you must drop any rows that have NA values for latitude or longitude. If your data contains NA values, then the st_as_sf() function will throw an error. Below is an example of reading in data from a CSV and converting it to an sf dataframe.

library(sf)

state_capitals <- read_csv("mapping/data/state-capitals.csv")

state_capitals_sf <- state_capitals %>% 
  st_as_sf(coords = c("longitude", "latitude"),
           crs = 4326)

Appending spatial info to your data

Oftentimes, the data you are working with will just have state or county identifiers and will not have any geographic information In this case, you must do the extra work of downloading in the geographic data as an sf dataframe and then joining your non-spatial data to some spatial data. Generally this involves 3 steps:

  1. Reading in your own data as a data frame
  2. Reading in the geographic data as an sf dataframe
  3. Using left_join to merge the geographic data with your own non spatial data and create a new expanded sf dataframe

Let’s say we wanted a sf dataframe of CHIP enrollment by state, but all we had was data on CHIP enrollment and state abbrevations. The below code chunk shows how we add the spatial data to our original dataframe and turn it into a sf datafarme.

# load necessary packages
library(tidyverse)
library(urbnmapr)
library(urbnthemes)
library(janitor)

set_urbn_defaults(style = "map")

# read the state CHIP data
# replace with project filepath to your CSV below
chip <- read_csv("mapping/data/chip-enrollment.csv") %>% 
  # clean column names so there are no random spaces/uppercase letters
  janitor::clean_names()

# read in state geographic data from urbnmapr
states <- get_urbn_map(map = "states", sf = TRUE)


# left join state geographies to state_chip_data
chip_with_geographies = left_join(states, chip, 
                                  # Specify join column, which are slightly
                                  # differently named in states and chip 
                                  # respectively
                                  by = c("state_abbv" = "state_abbreviation"))

chip_with_geographies %>% 
  select(state_fips, state_abbv, chip_enrollment)
## Simple feature collection with 51 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -2600000 ymin: -2363000 xmax: 2516374 ymax: 732352.2
## Projected CRS: US National Atlas Equal Area
## First 10 features:
##    state_fips state_abbv chip_enrollment                       geometry
## 1          01         AL          150040 MULTIPOLYGON (((1150023 -15...
## 2          04         AZ           88224 MULTIPOLYGON (((-1386136 -1...
## 3          08         CO          167227 MULTIPOLYGON (((-786661.9 -...
## 4          09         CT           25551 MULTIPOLYGON (((2156197 -83...
## 5          12         FL          374884 MULTIPOLYGON (((1953691 -20...
## 6          13         GA          232050 MULTIPOLYGON (((1308636 -10...
## 7          16         ID           35964 MULTIPOLYGON (((-1357097 78...
## 8          18         IN          114927 MULTIPOLYGON (((1042064 -71...
## 9          20         KS           79319 MULTIPOLYGON (((-174904.2 -...
## 10         22         LA          161565 MULTIPOLYGON (((1075669 -15...

A common pitfall when doing the left_join above is that some records may be missing from either the left hand side dataframe (in this case states) or the right hand side dataframe (in this case chip). This can happen if a few states are missing in your data, or your data has misspelled states that weren’t matched.

We highly recommend using libraray(tidylog) and the tidylog::left_join() function - instead of the regular left_join function - to print additional information on the number of rows that are unmatched, or rows only in the left hand side/right hand side dataframes. This will help you identify errors and mismatches between datasets early on in your data ingestion process. For example, if we want to append population counts to each of the states using another datasource:

state_pops <- 
  read_csv("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv") %>% 
  filter(ages == "total", year == "2010") %>% 
  select(state_abbv   = `state/region`, population)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `state/region` = col_character(),
##   ages = col_character(),
##   year = col_double(),
##   population = col_double()
## )
# Do anti-joins to figure out 
chip_with_geographies <- chip_with_geographies  %>% 
  # Specify left_join from tidylog to print messages
  # Don't library(tidylog) unless you want all dplyr functions to print out
  # status messages. We recommend using tidylog:: explicitly whenever you want
  # print out status messages
  tidylog::left_join(state_pops, by = "state_abbv")
## left_join: added one column (population)
##            > rows only in x    0
##            > rows only in y  ( 2)
##            > matched rows     51
##            >                 ====
##            > rows total       51

Coordinate Reference Systems

What are they?

Coordinate reference systems (CRS) specify the 3d shape of the earth used in a spatial dataset and optionally how we project that 3d shape onto a 2d surface. They are an important part of working with spatial data as you need to ensure that all the data you are working with are in the same CRS in order for spatial opperations and maps to be accurate.

To visualize how coordinate reference systems work, imagine that the earth is a orange. Now peel the skin off an orange and try to flatten it. There are many ways to do it, but all will create distortions of some kind. The CRS will give us the formula we’ve used to specify the shape of the orange (usually a sphere or ellipsoid of some kind) and optionally, specify how we flattened the orange into 2d.

Broadly, there are two kinds of Coordinate Reference Systems:

  1. Geographic coordinate systems (sometimes called unprojected coordinate systems)

    • Specifies a 3d shape for the earth
    • Uses a spheroid/ellipsoid surface to approximate shape of the earth
    • Usually use decimal degree units (ie latitude/longitude) to identifylocations on earth

  1. Projected coordinate systems
    • Specifies a 3d shape for the earth + a 2d mappin
    • Is a geographic coordinate system + a projection
    • projection: mathematical formula used to convert a 3d coordinate system to a 2d flat coordinate system
    • Many different kinds of projections, including Equal Area, Equidistant, Conformal
    • All projections distort the true shape of the earth in some way, either in terms of shape, area, or angle. Required xkcd comic](https://xkcd.com/977/))
    • Usually use linear units (ie feet, meters) and therefore useful for distance based spatial operations (ie creating buffers)

CRS can be specified either by name (ie Maryland State Plane) or EPSG identifier (ie 2248). THe EPSG identifier is a numeric identifier that uniquely identifies a coordinate reference system.

EPSG stands for the now defunct European Petroleum Survey Group. While oil companies have generally been terrible for the earth, the one nice thing they did was set up common standards for coordinate reference systems .

Finding the CRS

If you are lucky, your data will have embedded CRS data that will be automatically detected when the file is read in. This is usually the case for GeoJSONS (.geojson) and shapefiles (.shp). When you use st_read() on these files, you should see the CRS displayed in the metadata:

  knitr::include_graphics(here("mapping", "www", "images", "sf_crs_pic.png"))

Sometimes however, this CRS field will be blank or be NA as the dataset did not specify the CRS. In that case you must find and set the CRS for your data before proceeding with analysis. Below are some good rules of thumb for finding out what CRS your data is in:

  • For geojsons, the CRS should always be 4326 (or WGS 84). The official geojson specification states that this is the only valid crs for geojsons, but in the wild, this may not be true 100% of the time.

  • For shapefiles, there should be a file that ends in .proj in the same directory as the .shp file. This file contains the projection information for that file.

TODO: Put in proj2crs site.

  • For CSV’s with latitude/longitude columns, the CRS is usually 4326 (or WGS 84).

  • For all other file types, you should scan the metadata and any accompanying documentation to see if the coordinate reference system for geographic data is specified

Setting the CRS

The sf package has a number of handy functions for identifying, setting, or transforming rthe CRS for a spatial dataset.

TODO: Add info on CRS

Mapping

In order to start mapping, you need an sf dataframe. If you don’t have one, see the Basic Concepts section above.

When should you map?

Before you make a map, we encourage you to read this snippet from the The [Urban Institute Data Visualization Style Guide]:

Just because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.

The best advice we can give you is to first think about the point you’re trying to get across to your readers and then decide whether a map will allow you to most effectively do that. For example, if you really want readers to understand that there is a large gap between the top 5 states with high housing costs and the rest of the states, perhaps you are better off using an ordered bar chart or an ordered table. But if you want readers to understand that there is a clear geographic pattern of coastal states having higher housing costs, perhaps a map would be the appropriate visualization.

Below are some helpful blogposts which might help you answer these deeper questions `
* When Maps Shouldn’t be Maps * How a map’s “bins” can influence your perception of an important policy issue

The basics

library(ggplot2)

Most mapping in R fits the same theoretical framework as plotting in R using library(ggplot2). Hadley Wickham’s ggplot2 is based on Leland Wilkinson’s The Grammar of Graphics and Wickham’s A Layered Grammar of Graphics. The layered grammar of graphics is a structured way of thinking about the components of a plot, which correspond to the arguments and functions used in ggplot2 and ultimately mapping. The main components of a plot (and a map) are the following:

  • Data are what are visualized in a plot and aesthetic mappings are directions for how data are mapped in a plot in a way that can be perceived by humans (ie color, transparency, shape, etc)
  • Geoms are representations of the actual data like points, lines, and bars.
  • Stats are statistical transformations that represent summaries of the data like histograms.
  • Scales map values in the data space to values in the aesthetic space. Scales draw legends and axes.
  • Coordinate Systems describe how geoms are mapped to the plane of the graphic.
  • Facets break the data into meaningful subsets like small multiples.
  • Themes control the finer points of a plot such as fonts, font sizes, and background colors.

If this sounds like a lot, don’t worry! We usually only make use of a couple of these properties explicitly when making a map and a lot of these properties are simplified for maps. For example,geom_sf is the main geom used for mapping in ggplot2. And since sf objects have a column named geometry that contains the geographic information needed for the map, you usually do not need to specify any aesthetic mappings. We promise, it’s easier than it looks.

A simple map

To make a simple map, all you need to do is add a geom_sf() to a ggplot and set the data argument equal to an sf dataframe. Say for example you wanted to make a map of all 50 states using data from library(urbnmapr):

library(urbnmapr)

states <- get_urbn_map("states", sf = TRUE)

ggplot() +
  geom_sf(data = states, mapping = aes())

Note that we set no aes mappings as we currently do not want to apply any mappings (like coloring in the states by a variable or allowing the transparency to vary by a variable). We will cover how to set such aesthetic mappings later.

Styling

library(urbnthemes)

library(urbnthemes) has functions that automatically styles maps in accordance with the Urban Institute Data Visualization Style Guide. By using library(urbnthemes), you can create publication ready maps you can immediately drop in to Urban research briefs or blog posts.

To install urbnthemes, visit the package’s GitHub repository.

You can use the automated theming functions in two ways: set the map defaults once per script, or add theme_urbn_map() to the end of each of your created maps. These functions will get rid of unnecessary axes, labels, and gridlines that are useful for charts, but not needed for maps.

library(urbnthemes)

# You can either run this once per script to automatically style all maps with
# the Urban theme
set_urbn_defaults(style = "map")

# Or you can add `+ theme_urbn_map()` to the end of every plot you make
ggplot() +
  geom_sf(states, mapping = aes()) +
  theme_urbn_map()

Layering

Just like in other GIS software, you can layer multiple geometries on top of each other using the + operator from library(ggplot2). The shapes will appear from bottom to top (ie the last mapped object will show up on top). It is important that all layers are in the same CRS(coordinate reference system).

state_capitals <- read_csv("mapping/data/state-capitals.csv") %>% 
  # We filter out AL and HI for now as mapping points in AL and HI can get 
  # tricky when they are insets
  filter(!state %in% c("Alaska", "Hawaii")) %>% 
  st_as_sf(coords = c("longitude", "latitude"),
           crs = 4326) %>% 
  st_transform(crs = 2163)

ggplot() +
  geom_sf(data = states %>% 
              # We filter out AL and HI for now as mapping points in AL and HI 
              # can get tricky when they are insets
            filter(!state_abbv %in% c("AK", "HI")), 
          mapping = aes()) +
  geom_sf(data = state_capitals, mapping = aes(),
          color = "#1696d2", size = 2.0) +
  theme_urbn_map()

Fill and Outline Colors

The same commands used to change colors, opacity, lines, size, etc. in charts can be used for maps too. For instance, if I wanted to change the colors of the map above, I would simply use the fill = and color = parameters in geom_sf(). fill will change the fill color of polygons; color will change the color of polygon outlines, lines, and points.

library(urbnthemes) contains helper variables (like palette_urbn_main) for accessing color palettes from the Urban Data Viz Style guide.

ggplot() +
  geom_sf(states, mapping = aes(),
          fill = palette_urbn_main['magenta'], 
          color = "white") +
  theme_urbn_map()

Adding text

Something about geom_sf_text() and geom_sf_label()

Urban Institute style

The Urban Institute Data Visualization Style Guide is the best place to start when making any visualization. The guide offers some blunt suggestions for maps:

Just because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.

That said, some of the most-convincing data visualizations are maps. The style guide has well-defined guidelines for standard chart and graph types, but some of the more complex or novel visualizations below will need to be styled and tweaked on a case-by-case basis. For external-facing products (presentations, publications, web interactives, etc.) you will likely need to collaborate with Urban’s Communications team. Please contact Ben Chartoff () with questions or thoughts related to visual presentation, or discuss with the editorial team in the course of editing your publication/product.

Here are a few things to keep in mind:

  • The following maps are styled by the Urban Institute package urbnthemes.
  • U.S. dot maps and U.S. choropleths use the Albers Equal-Area Conic Projection. Tile-based maps like tile grid maps and geofaceting use the Mercator projection.
  • Maps that show the magnitude of a variable use the blue sequential ramp and maps that display positives and negatives use the diverging color ramp.
  • In general, the borders between geographies are white.

Choropleth Maps

Choropleth maps display geographic areas with shades, colors, or patterns in proportion to a variable or variables. Choropleth maps can represent massive geographies like the entire world and small geographies like Census Tracts. To make a choropleth map where you shade in regions in proportion to a variable, you can set an aesthetic mapping inside the aes argument of geom_sf.

When we set an aesthetic mapping, we are telling R that we want some aspect of the plot to change in accordance with a variable in our dataset. And because we want the fill color to vary, we use the fill aesthetic.

Continuous color scale

Let’s say we wanted to make a chloropleth map of states colored in by CHIP enrollment as a percentage of total population (ie the chip_enrollment variable divided by the population variable in the chip_with_geographies dataframe). Below is the code we would use

library(tidyverse)
library(sf)

# Calculate the chip enrollment percentage and append as a column
chip_with_geographies <- chip_with_geographies %>% 
  mutate(chip_pct = chip_enrollment/population)

# make chloropleth map
ggplot() + 
  geom_sf(# data has to be an sf dataframe
          data = chip_with_geographies, 
          # we set the fill aesthetic to be a column for chloropleth maps
          aes(fill = chip_pct),
          # Set off-white boundaries
          color = "white") + 
  theme_urbn_map() +
  # Display legend labels as percentages. Note we use use scale_fill_continuous bc
  # we want a continuous legend
  scale_fill_continuous(labels = scales::percent)

Continuous color scale

You can also have discrete color bins (ie 0-1%, 1-2%, etc) for the legend. Be careful when doing this as changing the number or size of bins can drastically change how the map looks.

# make chloropleth map with discrete buckets

# Discretize continuous chip_pct variable
x = chip_with_geographies %>% 
  # Add two rows where chip_pct = 0 and 1 in order to have cut_interval
  # generate bins that include 0 and 1 as they aren't in the data explicitly
  add_row(chip_pct = 0) %>% 
  add_row(chip_pct = 1) %>% 
  mutate(chip_pct = cut_interval(chip_pct, n =3))
          
         
ggplot() + 
  geom_sf(data = x, 
          # we set the fill aesthetic to be a column for chloropleth maps
          aes(fill = chip_pct)) + 
  theme_urbn_map()

When mapping, you need to be very careful when setting color breaks for legends as slightly different breaks can generate very different looking maps

Interactive Maps

Interactive maps can be a great tool to explore and understand your data. And luckily there are a lot of new R packages that make it really easy to create them.

library(mapview)

library(mapview) is probably the most user friendly of the interactive mapping R libraries. All you have to do to create an interactive map is:

library(mapview)

mapview(chip_with_geographies)

A couple of things to note about this interactive map:

  - When you click on an object, you get a popup table of all it's attributes
  - When you hover over an object, you get a popup with an object id
  - The default basemap used is from Carto 

Each of the above behaviors can be changed if desired. As you’ll see in the below section, the syntax for library(mapview) is significantly different from libraray(ggplot2) so be careful!

Coloring in points/polygons

In order to create a chloropleth map where we color in the points/polygons by a variable, we need to feed in a column name in quotes to thezcol argument inside the mapview() function:

# Create interactive state map colored in by chip enrollment,
mapview(chip_with_geographies, zcol = "chip_enrollment")

If you want more granular control over the color palette for the legend can also feed in a vector of color hex codes to col.regions along with a column name to zcol. This will create a continuous color range along the provided colors. Be careful though as the color interpolation is not perfect.

library(RColorBrewer)
mapview(chip_with_geographies, 
        col.regions = c(palette_urbn_green[6], 
                        "white", 
                        palette_urbn_cyan[6]),
        zcol = "chip_enrollment") 

If you want to color in all points/polygons as the same color, just feed in a single color hex code to the col.regions argument:

mapview(chip_with_geographies, col.regions = palette_urbn_green[5]) 

Adding layers

You can add multiple sf objects on the same map by using the + operator. This is very useful when comparing 2 or more spatial datasets.

mapview(chip_with_geographies, col.regions = palette_urbn_green[5]) +
  mapview(state_capitals, col.regions = palette_urbn_cyan[5])

You can even create slider maps by using the | operator!

mapview(chip_with_geographies, col.regions = palette_urbn_green[5]) |
  mapview(state_capitals, col.regions = palette_urbn_cyan[5])

More details

To learn more about more advanced options with mapview maps, check out their documentation page and the reference manual.

There are also other interactive map making packages in R like leaflet (which mapview is a more user friendly wrapper of), tmap, and mapdeck. To learn about these other packages, this book chapter is a good starting point.

Dot Maps

Tile Grid Maps

Geofacetting

Cartograms

Spatial Operations

Calculating Distance

Spatial Joins

Cropping

Aggregating

Drive/Transit times

Geocoding

Geocoding is the process of turning text (usually addresses) into geographic coordinates (usually latitudes/longitudes) for use in mapping. For Urban researchers, we highly recommend using the Urban geocoder as it is fast, accurate, designed to work with sensitive/confidential data and most importantly free to use for Urban researchers! To learn about how we set up and chose the geocoder for the Urban Institute, you can read our Data@Urban blog.

Cleaning Addresses

The single most important factor in getting accurate geocoded data is having clean, well structured address data. This can prove difficult as address data out in the wild is often messy and unstandardized. While the rules for cleaning addresses are very data specific, below are some examples of clean addresses you should aim for in your data cleaning process:

library(gt)
cleaned_address_table  = tribble(
  ~"f_address", ~"Type of address",
  "123 Troy Drive, Pillowtown, CO, 92432", "residnetial address",
  "789 Abed Avenue, Apt 666, Blankesburg, CO, 92489", "residential apartment address",
  "Shirley Boulevard and Britta Drive, Blanketsburg, CO, 92489", "street intersection",
  "Pillowtown, CO", "city",
  "92489, CO", "Zip Code",
  
  )

gt(cleaned_address_table) %>% 
    # tab_header(title = md("Clean Address Examples")) %>% 
    opt_row_striping(row_striping = TRUE) %>% 
    tab_style(
    style = list(
        cell_text(weight = "bold")
        ),
      locations = cells_column_labels(
        columns = vars(f_address, `Type of address`)
      )) %>% 
    opt_align_table_header(align = c("left")) %>% 
    tab_options(
      container.width = "100%",
      container.height = "400px",
      # column_labels.background.color  = palette_urbn_cyan[1],
      table.border.top.width = 0,
      table.border.bottom.width = 0,
      column_labels.border.bottom.width= 0,
      

    )
f_address Type of address
123 Troy Drive, Pillowtown, CO, 92432 residnetial address
789 Abed Avenue, Apt 666, Blankesburg, CO, 92489 residential apartment address
Shirley Boulevard and Britta Drive, Blanketsburg, CO, 92489 street intersection
Pillowtown, CO city
92489, CO Zip Code

All that being said, our geocoder is pretty tolerant of different address formats, typos/spelling errors and missing state, zip codes, etc. So don’t spend too much time cleaning every address in the data. Also note that while our geocoder is able to geocode cities and zip codes, it will return the lat/lon of the center of the city/zip code, which may not be what you want.

Instructions

To use the Urban geocoder, you will need to:

  1. Generate a CSV with a column named f_address which contains the addresses in single line format (ie 123 Abed Avenue, Blanketsburg, CO, 94328). This means that if you have the addresses split across multiple columns (ie Address, City, State, Zip columns), you will need to concatenate them into one column. Also see our Address cleaning section above.

  2. Go to the Urban geocoder and answer the initial questions. This will tell you whether your data is non-confidential or confidential data, and allow you to upload your CSV for geocoding.

  3. Wait for an email telling you your results are ready. If your data is non-confidential, this email will contain a link to your geocoded results. This link expires in 24 hours, so make sure to download your data before then. If you data is confidential, the email will contain a link to the location on the Y Drive where your confidential geocoded data is stored. You can specify this output folder when submitting the CSV in step 1.

Geocoder outputs

The geocoded file will be your original data, plus a few more columns (including latitude and longitude). each of the new columns that have been appended to your original data. It’s very important that you take a look at the Addr_type column in the CSV before doing further analysis to check the accuracy of the geocoding process.

Column Description
Ma tch_addr The actual address that the inputted address was matched to. This is the address that the geocoder used to get Latitudes / Longitudes. If there are potentially many typos or non standard address formats in your data file, you will want to take a close look at this column to confirm that the matched address correctly handled typos and badly formatted addresses.
L ongitude The WGS 84 datum Longitude (EPSG code 4326)
Latitude The WGS 84 datum Latitude (EPSG code 4326)
A ddr_type The match level for a geocode request. This should be used as an indicator of the precision of geocode results. Generally, Subaddress, PointAddress, StreetAddress, and StreetInt represent accurate matches. The list below contains all possible values for this field. Green values represent High accuracy matches, yellow represents Medium accuracy matches and red represents Low accuracy/inaccurate matches. If you have many yellow and red values in your data, you should manually check the results before proceeding with analysis. All possible values:

Subaddress: A subset of a PointAddress that represents a house or building subaddress location, such as an apartment unit, floor, or individual building within a complex. The UnitName, UnitType, LevelName, LevelType, BldgName, and BldgType field values help to distinguish subaddresses which may be associated with the same PointAddress. Reference data consists of point features with associated house number, street name, and subaddress elements, along with administrative divisions and optional postal code; for example, 3836 Emerald Ave, Suite C, La Verne, CA, 91750.

PointAddress: A street address based on points that represent house and building locations. Typically, this is the most spatially accurate match level. Reference data contains address points with associated house numbers and street names, along with administrative divisions and optional postal code. The X / Y (Longitude/Latitude) and geometry output values for a PointAddress match represent the street entry location for the address; this is the location used for routing operations. The DisplayX and DisplayY values represent the rooftop, or actual, location of the address. Example: 380 New York St, Redlands, CA, 92373.

StreetAddress — A street address that differs from PointAddress because the house number is interpolated from a range of numbers. Reference data contains street center lines with house number ranges, along with administrative divisions and optional postal code information, for example, 647 Haight St, San Francisco, CA, 94117.

StreetInt: A street address consisting of a street intersection along with city and optional state and postal code information. This is derived from StreetAddress reference data, for example, Redlands Blvd & New York St, Redlands, CA, 92373.

StreetName: Similar to a street address but without the house number. Reference data contains street centerlines with associated street names (no numbered address ranges), along with administrative divisions and optional postal code, for example, W Olive Ave, Redlands, CA, 92373.

StreetAddressExt: An interpolated street address match that is returned when parameter matchOutOfRange=true and the input house number exceeds the house number range for the matched street segment.

DistanceMarker: A street address that represents the linear distance along a street, typically in kilometers or miles, from a designated origin location. Example: Carr 682 KM 4, Barceloneta, 00617.

PostalExt: A postal code with an additional extension, such as the United States Postal Service ZIP+4. Reference data is postal code points with extensions, for example, 90210-3841.

POI: —Points of interest. Reference data consists of administrative division place-names, businesses, landmarks, and geographic features, for example, Golden Gate Bridge.

Locality: A place-name representing a populated place. The Type output field provides more detailed information about the type of populated place. Possible Type values for Locality matches include Block, Sector, Neighborhood, District, City, MetroArea, County, State or Province, Territory, Country, and Zone. Example: Bogotá, COL,

PostalLoc: A combination of postal code and city name. Reference data is typically a union of postal boundaries and administrative (locality) boundaries, for example, 7132 Frauenkirchen.

Postal: Postal code. Reference data is postal code points, for example, 90210 USA.
Score A number from 1–100 indicating the degree to which the input tokens in a geocoding request match the address components in a candidate record. A score of 100 represents a perfect match, while lower scores represent decreasing match accuracy.
Status Indicates whether a batch geocode request results in a match, tie, or unmatched. Possible values include

M - Match. The returned address matches the input address and is the highest scoring candidate.

T - Tied. The returned address matches the input address but has the same score as one or more additional candidates.

U - Unmatched. No addresses match the inputtd address.
geometry The WKT (Well-known text) representation of the latitudes and longitudes. This column may be useful if you’re reading the CSV into R, Python, or ArcGIS
Region The state that Match_addr is located in
Re gionAbbr Abbreviated State Name. For example, CA for California
S ubregion The county that the input address is located in
M etroArea The name of the Metropolitan area that Match_addr is located in. This field may be blank if the input address is not located within a metro area.
City The city that Match_addr is located in
Nbrhd The Neighborhood that Match_addr is located in. Note these are ESRI defined neighborhoods which may or may not align with other sources neighborhood definitions

While interactive maps are a great tool to explore and understand your data, we do not recommend using them in Urban research briefs or blog posts. They are unfortunately pretty difficult to style in compliance with the Data Viz Style Guide and there are potential legal concerns with some of the basemaps not allowing free use for research purposes.

There are also legal concerns with using some of the common basemaps used in these interactive maps. Many are not open source and their terms of use prohbit them from being used for research purposes.

There are many packages in R that allow you to quickly create interactive maps. The two mai

11 –>